Questionnaires measuring movement behaviours in adults and older adults: Content description and measurement properties. A systematic review

Background Sleep, sedentary behaviour and physical activity are constituent parts of a 24h period and there are several questionnaires to measure these movement behaviours, the objective was to systematically review the literature on content and measurement properties of self- and proxy-reported questionnaires measuring movement behaviours in adults and older adults. Methods The databases PubMed, CINAHL, PsycINFO and SPORTDiscus were systematically searched until April 2021. Articles were included if: the questionnaires were design for adults and older adults; the sample size for validity studies had at least 50 participants; at least, both validity and test-retest reliability results of questionnaire that were developed specifically to measure the amount of sleep, sedentary behaviour or physical activity, or their combination were reported; and articles had to be written in English, Spanish, French, Portuguese, German, Italian or Chinese. Findings and conclusions Data extraction, results, studies’ quality, and risk of bias were evaluated using the Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines. Fifty-five articles were included in this review, describing 60 questionnaires. None of the questionnaires showed adequate criterion validity and adequate reliability, simultaneously; 68.3% showed adequate content validity. The risk of bias for criterion validity and reliability were very low in 72.2% and 23.6% of the studies, respectively. Existing questionnaires have insufficient measurement properties and frequent methodologic limitations, and none was developed considering the 24h movement behaviour paradigm. The lack of valid and reliable questionnaires assessing 24h movement behaviours in an integrated way, precludes accurate monitoring and surveillance systems of 24h movement behaviours.

None of the questionnaires showed adequate criterion validity and adequate reliability, simultaneously; 68.3% showed adequate content validity. The risk of bias for criterion validity and reliability were very low in 72.2% and 23.6% of the studies, respectively. Existing questionnaires have insufficient measurement properties and frequent methodologic limitations, and none was developed considering the 24h movement behaviour paradigm. The lack of valid and reliable questionnaires assessing 24h movement behaviours in an integrated way, precludes accurate monitoring and surveillance systems of 24h movement behaviours.

Introduction
In light of the recent 24h movement behaviour paradigm [1], sleep, sedentary behaviour (SB) and physical activity (PA) are constituent parts of a 24h period that interact and influence health. This new paradigm has led some countries, as well as the World Health Organization (WHO) to develop 24h movement guidelines [2][3][4]. With its development and launching in other countries there is a tangible need to accurately assess movement behaviours in an integrated way; and monitoring and surveillance systems will need to be adapted to assess compliance with such guidelines. The accurate assessment of movement behaviours is also essential for research, policy, and practice. Despite the advantages of objective methods to assess movement behaviours, such as accelerometery (e.g., do not depend on participant recall) in large epidemiological studies and clinical settings, self-or proxy-reported questionnaires are often preferred, given their practicality, simplicity, affordability, and low burden for participants (in terms of time consuming and acceptability) [5][6][7]. Moreover, these are capable of gathering valuable contextual information (e.g., domains, settings, types) of the behaviours, that objective measures are unable to [8]. Nevertheless, assessing 24h movement behaviours is challenging and complex, given that movement behaviours questionnaires are often prone to measurement errors and reporting bias due to misreporting, whether due to social desirability bias or cognitive issues related to recall or comprehension [9].
The usefulness of a self-reported measure is dictated by its qualitative attributes (i.e., content validity) and psychometric properties, such as test-retest reliability and criterion validity. As such, questionnaires must be adequately developed and described, presenting adequate content and measurement properties, because if the development method and the measurement properties are weak or not extensively known, the risk of misclassification, biased and unreliable results is high [10].
The self-reported assessment of movement behaviours has generally been done by assessing each behaviour per se and consequently, evidence of the content analysis and measurement properties of the instruments used to assess these behaviours has also been done in isolation. Recently, two systematic reviews [11,12] on measurement properties of PA questionnaires reported several limitations, particularly related to statistical methods and accelerometery interpretation; and that the methodological quality of the studies could be improved by increasing sample size, enhancing statistical procedures and reporting methods, and choosing better comparison measures for validity studies. Regarding SB, two other systematic reviews [13,14] reported poor levels of agreement and accuracy with under and overestimation of total time spent in SB. Altogether, these reviews indicate that precise self-report instruments to measure PA and SB are still scarce [15]. Concerning sleep questionnaires, these seem to be primarily used as a diagnostic tool and to be relatively accurate [16]. Despite the reduced accuracy when compared with diaries and objective instruments, questionnaire-based data is considered relevant due to the importance of each person's self-perception about their sleep [16]. However, it is unclear whether there are questionnaires assessing sleep considering it as part of a 24h period (i.e., as a movement behaviour).
The fact that movement behaviours have traditionally been subjectively assessed individually (each behaviour per se) and ignoring the intrinsic and empirical interactions between them [17,18], may partly be because there is no single questionnaire that assesses 24h movement behaviours in an integrated way. Selecting the best questionnaire for each movement behaviour (or their combination) is difficult, given the high variability in their content and the inadequate measurement properties. This has been documented in previous reviews [11,12,14,16,19,20]. However, none of these reviews assessed the questionnaires that measure the combination of these behaviours at the same time. Therefore, reviewing the questionnaires measuring all the movement behaviours, individually or in combination, in adults and older adults, is necessary. In this context, we aimed to systematically review the literature on content and measurement properties of self-and proxy-reported questionnaires measuring the movement behaviours or its combination, in adults and older adults.

Information sources and search strategy
A systematic search through the electronic databases PubMed, CINAHL, PsycINFO and SPORTDiscus was conducted in April 2021, from inception until April 2021. Additional studies were identified by manually searching references of the retrieved papers.
The electronic databases were searched for variations of the terms 'PA', 'SB', 'sleep', 'movement behaviours', 'questionnaire' and 'measurement properties'. A supporting file shows this in more detail [see S1 File]. The search terms used for 'measurement properties' were the ones proposed by COSMIN guidelines [21]. The search terms were adapted for each specific electronic database to ensure the quality of the systematic searching (e.g., in PubMed's case, MESH terms were used when applicable).

Eligibility criteria
Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines for systematic reviews of patient-reported outcome measures [21], were adapted to the purpose of this review and followed. The COSMIN guidelines are in concordance with the Cochrane Handbook for systematic reviews of interventions [22] and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [23].
To identify and characterize valid and reliable self-reported or proxy-reported questionnaires assessing sleep, sedentary behaviour and physical activity, or their combination, the following inclusion criteria were defined: 1) participants were adults (� 18 years) or older adults (�65 years), living in the community; 2) minimum sample size of 50 participants for validity studies [24]; 3) articles reporting at least, both validity and test-retest reliability results [25] of questionnaire that were developed specifically to measure the amount of sleep, SB or PA, or its combination; 4) articles written in English, Spanish, French, Portuguese, German, Italian or Chinese.
The exclusion criteria were the following: 1) articles that used doubly labelled water as gold standard for validity purposes, given that doubly labelled water assesses total energy expenditure, not only PA energy expenditure and, as such, it has been considered an unreliable criterion measure for PA levels [11,25]; 2) reporting measurement properties of instruments that aimed solely to predict or detect a given health condition, designed for special populations (e.g., chronic, auto-immune and infectious diseases, sleep disorders, athletes, pregnant women) or focused only on lifetime PA; 3) reporting measurement properties of questionnaires that were not designed to validate an original questionnaire (e.g. reported linguistic validation); 4) articles reporting measurement properties of logs, diaries or interviews of movement behaviours; 5) grey literature (e.g. policy reports; government documents; working papers; conference proceedings; thesis and books or book chapters), reviews, meta-analyses, cost-effectiveness studies and commentaries.

Study selection process
Three authors (BR, JE and EVC) independently screened articles by title, abstract and full text. Results were cross-checked and disagreements were resolved by discussion with a fourth author (RS), until consensus was reached. Reference lists of identified articles were also reviewed to ensure that no relevant articles were overlooked. These processes were conducted using the CADIMA software [26].

Data collection process and data items
A standardized data extraction form was created to record relevant information from the included articles about the questionnaires' content, validity, reliability, measurement error and responsiveness. A supporting file shows this in more detail [see S2 File].
Given the characteristics of this review, the data extraction on content and measurement properties was based on the COSMIN guidelines [21], the Taxonomy of Self-reported SB Tools (TASST) framework [13] and the Quality Assessment of PA Questionnaire Checklist (QAPAQ) [25]. For measurement properties, the Edinburgh Framework for validity and reliability in PA and SB measurement was also considered [27]. When needed, adaptations have been made to integrate sleep as a movement behaviour. The measurement properties' definitions used in this study are presented in Table 1. The degree to which an instrument truly measures the construct(s) that wants to measure, free from all possible sources of error or bias.

Convergent validity
The extent of the agreement with another (non-criterion) measure that should assess the same behaviour parameter based on face and content validity.

Criterion validity
The extent of the correlation between a measure and another already considered as being a criterion or gold standard.

Reliability
The extent to which an instrument gives consistent, stable, and repeatable measurement. In other words, it is free from measurement error.

Test-Retest
The extent to which test scores are consistent from one test administration to the next, keeping the same conditions (e.g., researcher, timing, preparation, etc.)

Study risk of bias assessment
The Risk of Bias checklist developed by COSMIN is exclusively for assessing the methodological quality of single studies included in systematic reviews of questionnaires [21]. Given the characteristics of this review, this checklist was adapted. The checklist herein presented has a 4-point scale (i.e., 'very low risk, 'low risk', 'medium risk' or 'high risk'), and contains items on criterion validity, reliability, measurement error and responsiveness. For each measurement property, different design requirements and statistical methods were rated based on the COS-MIN standards. Each measurement property was evaluated separately. The overall rating was determined based on "the worst score counts" method as proposed by COSMIN. The criteria for each item can be found in COSMIN guidelines [21]. For reliability, as previously done [19], we defined an 'adequate' time interval between test and retest as follows: > 1 day and � 3 months for questionnaires recalling a usual week/month; > 1 day and � 2 weeks for questionnaires recalling the previous week; > 1 day and � 1 week for questionnaires recalling the previous day; > 1 day and � 1 year for questionnaires recalling the previous year. The data was collected independently by 3 authors (BR, JE and EVC) and disagreements were resolved by discussion with a fourth author (RS).

Effect measures
2.6.1 Quality of measurement properties. To evaluate the studies' quality of measurement properties we followed the COSMIN guidelines; as such, all measurement properties were rated against quality criteria for good measurement properties [28]. Each result was rated as 'adequate' (+), 'inadequate' (-), or 'doubtful' (?) when design or method was not well reported (e.g., lack of information regarding sample characteristics, lack of information regarding criterion validity).
A study was considered to have 'adequate' criterion validity when results for correlations between the questionnaire and the criterion instrument were � 0.70. The accelerometer was considered a criterion measure because, despite that there is no gold standard to measure all movement behaviours, the accelerometer is the only instrument able to do it with proved accuracy and is widely used as criterion comparison measure in validation studies of movement behaviours' questionnaires [5].
For convergent validity, statistically significant correlations (p<0.05) between the movement behaviour and assessments related to the behaviour in question (e.g., between PA and VO 2max ) of � 0.5 and correlations between the movement behaviour measured by similar selfreported instruments of � 0.7 were considered 'adequate'.
For reliability, Intraclass Correlation Coefficient (ICC) or weighted Kappa � 0.70 were considered 'adequate'; the use of Pearson or Spearman correlation coefficients was considered 'inadequate', because it does not have into account systematic errors [29]. However, Pearson and Spearman correlations > 0.80 were rated positively, similarly to what has been previously done [11].
Measurement error was considered 'adequate' when the smallest detectable changes or limits of agreement (LoA) were inferior to minimal important change, and 'doubtful' when minimal important change was not defined.
Responsiveness was considered 'adequate' when the result was in accordance with the hypothesis or Area Under the Curve (AUC) � 0.70, and 'doubtful' when no hypothesis was defined.
For the overall rating of the quality of the studies, if 75% of the results per study were 'adequate', the overall rating was considered 'adequate'.
2.6.2 Content validity. Given the characteristics of our search strategy, we did not perform a comprehensive analysis of content validity, but rather applied a subjective reviewers' rating to assess the content validity of all included questionnaires, as suggested by COSMIN guidelines [21]. In this analysis, several aspects were evaluated as 'adequate' (+) or 'inadequate' (-), such as: 1) items relevance for the construct, population, and context of use (i.e., the item had to be directed related to the construct or behaviour evaluated); 2) response options and recall period appropriateness for construct, population and context of use (i.e., closed response options were considered inappropriate because they do not capture the movement continuum; the recall period and context had to be clearly stated); 3) comprehensiveness of the construct, population and context of use (i.e., key aspects, such as duration or intensity related to the construct or behaviour had to be clearly stated); and 4) language appropriateness of the response options and items (i.e., clear and simple language).
To evaluate content validity, if the questionnaire was not integrated in the article, we either contacted the authors requesting for the questionnaire or searched online to find it. If access to the questionnaire was not possible, we rated it with 'cannot be determined´.

Synthesis methods
We conducted a narrative synthesis of the results and organized it in the respective tables (as presented in the results section below).

Search results
The search yield 16,182 articles after removing duplicates. Twelve articles were added after searches in other reviews. Based on titles and abstracts, 108 full texts were selected, and 55 were included, describing 60 questionnaires. The reasons for exclusion of full texts are described in Fig 1.
Regarding PA questionnaires , 68% assessed multi-domain PAs, with leisure-time PA being the most frequent domain (measured in 19 out of 25 PA questionnaires included). The most prevalent response method was the continuous method (68%), focusing on different metrics (e.g., hours/day). The most frequent measurement unit was METs/hour or minute per week or minutes per day (44%). Most of the questionnaires (72%) assessed multiple scores. Recall periods varied from past year (24%), past week (52%), usual week (24%) to currently (12%). None of the questionnaires specified the assessment period (whether a participant is asked regarding a particular type of day, e.g., only weekend days). The number of items included in the PA questionnaires ranged from one to 74. In the SB questionnaires [51][52][53][54][55][56][57][58][59][60][61], the most prevalent domains were total SB/sitting time (50%) and multi-domain (41.7%). The continuous response method was the most prevalent (66.7%), in hours per day (41.7%) and minutes per day (41.7%). The measurement units depended on the objective of assessment, and the most used score was total SB (91.7%). The most frequent recall periods were past week (33.3%) and usual day (33.3%). Assessment period was specified in 66.7% of the SB questionnaires. The number of items included in the questionnaires ranged from 1 to 20.
There was only one questionnaire assessing sleep duration (Behavioral Risk Factor Surveillance System (BRFSS) Sleep questionnaire) [62]. The response method was continuous, and the measurement unit was hours/day. The recall period was a usual day.
The questionnaires combining PA and SB [63][64][65][66][67][68][69][70][71][72][73][74], mostly assessed the behaviours through multi-domain (83.3%) with the occupational domain being the most prevalent (11 out of 12 questionnaires). The occupational domain was also used in single domain questionnaires [71,72]. The most prevalent response method was the continuous method (75%) focusing on different metrics (e.g., hours/week). The most prevalent measurement unit was time (75%) (e.g., hours/week) and several scores were evaluated in all questionnaires, rather than just one score. The most frequent recall periods were usual day/week (66.7%). Assessment period was not specified in 66.7% of the questionnaires. The number of items included in the questionnaires ranged from 3 to 75.
One questionnaire assessed both SB and sleep [75]. This questionnaire had 41 items assessing multi-domain behaviours, the response method was continuous, the measurement unit was hours/day with multi-scores evaluated and the assessment period was specified.
All, except two [79,81] of the questionnaires measuring a combination of PA, SB, and sleep [67,[76][77][78][79][80][81][82][83] assessed these behaviours through multiple domains. The most prevalent response method was the continuous method (77.8%), focusing on different metrics (e.g., hours/week). The most prevalent measurement units for SB and PA were energy and intensity variables (77.8%) (e.g., METs, kcals) and several scores were evaluated in all questionnaires. For sleep items, the measurement unit was always hours/day. The recall periods focused on the past (55.6%) and in the usual activity (44.4%). Assessment period was not specified in 77.8% of the questionnaires. The number of items included in the questionnaires ranged from 5 to 448. Among these questionnaires, none was designed in terms of content and final scores, to assess all movement behaviours considering the 24h movement behaviour paradigm. The characteristics of the included questionnaires included are presented in Table 2. Table 3 presents the summary of the content validity results and its details are provided in a supporting file [see S1 Table]. Most of the questionnaires (68.3%) showed 'adequate' content validity.

Content validity.
Regarding PA questionnaires, only three were considered 'inadequate' (two in adults [33,34] and one in adults and older adults [43]). Three questionnaires (in adults) [33,35,36] were not available, therefore, their content validity could not be determined.
For SB, three questionnaires (two in adults [55,56] and one in adults and older adults [61]) were considered to have inadequate content validity. One questionnaire [57] was not assessed as its content was not available.
The sleep questionnaire was considered to have 'adequate' content validity. For PA and SB, three questionnaires were considered to have 'inadequate' content validity (one in adults [63], one in older adults [68] and one in both [73]).
The SB and sleep questionnaire [75] was considered with adequate content validity. For PA, SB and sleep questionnaires 4 questionnaires were considered 'inadequate' (two in adults [76,80] and two in adults and older adults [82]. One questionnaires [81] (in adults and older adults) was not available, therefore, their content validity could not be determined.
The main reason for the content validity inadequacy was the response options not being appropriate (i.e., closed response, rating scales). Table 3 presents the summary of the results for validity, and its details are provided in a supporting file [see S2 Table]. Only the Athens Physical Activity Questionnaire (APAQ) [77] had 'adequate' overall quality for criterion validity and the International Physical Activity Questionnaire (IPAQ) [65] had 'adequate' overall quality for convergent validity. Overall, 36.7% of the studies did not specify the sample characteristics. The most frequently calculated coefficients were Pearson and Spearman correlations, Kappa's coefficients, percentages of agreement and intraclass correlation coefficients. Bland and Altman statistics examined measurements of precision in 30% of the questionnaires.

Validity.
In the PA questionnaires, none of the questionnaires showed overall 'adequate' criterion or convergent validity. Criterion validity was assessed with accelerometery in 76% of the questionnaires; however, the accelerometer protocols used (e.g., epoch length, valid day definition) varied substantially between studies. The best results with accelerometery were regarded Self-Report Physical Activity Questionnaire (SPAQ) light, moderate and household PA scores [48], and Transport and Physical Activity Questionnaire (TPAQ) vigorous PA score [49]. Some questionnaires [34,35,[42][43][44] only assessed convergent validity and these were performed against other subjective measures or variables related to PA behaviour (e.g., VO 2max , body fat). The CARDIA Physical Activity History (CARDIA) [33], Minnesota Heart Health Program Questionnaire (MHHP Q) [33], 13-Item Physical Activity Questionnaire (13I-PAQ) [42] and Incidental and Planned Exercise Questionnaire (IPEQ) [44] questionnaires were the ones showing the best convergent validity in some scores.
Regarding SB questionnaires, none showed overall 'adequate' criterion or convergent validity. The accelerometer was the criterion measure in 91.7% of the questionnaires. The Australian Longitudinal Study on Women's Health-Sedentary Behaviour Questions (ALSWH-SB Q) [52] showed the best convergent validity scores; nevertheless, these only took into account computer use (r = 0.74) and occupational SB (ICC = 0.77).   Regarding sleep, the BRFSS Sleep questionnaire [62] was evaluated against criterion and convergent measures; however, its validity quality could not be determined given that only Bland and Altman statistics were performed.
For the questionnaires combining PA and SB, none showed overall 'adequate' criterion validity. For these questionnaires, the accelerometer was the criterion measure in 75% of the questionnaires. Concerning criterion validity, the Sedentary, Transportation and Activity Questionnaire (STAQ) [64] questionnaire showed the best performance regarding the sitting time at work score (ICC = 0.82), when evaluated against accelerometery. The IPAQ's short form, past and usual week versions, were rated with an 'adequate' overall convergent validity, when compared to the respective long forms [65]. The SIT-Q [75] was the only questionnaire combining SB and sleep and was evaluated against one convergent measure (e.g., Seven-Day Activity Diary). In this questionnaire, occupational SB was the only score with 'adequate' convergent validity (rho = 0.75).
Concerning the questionnaires combining all movement behaviours, the APAQ [77] showed 'adequate' overall criterion validity against accelerometery for total energy expenditure (rho = 0.84). The Sedentary Time and Activity Reporting Questionnaire (STAR-Q) [78] showed the best performance for convergent validity; this was assessed against a 7-day activity diary (energy expenditure rho = 0.74; general occupational activity rho = 0.71; occupational sitting rho = 0.75; and SB rho = 0.75). Table 3 presents the summary of the results of the reliability and its details are provided in a supporting file [see S3 Table]. 'Adequate' overall reliability quality was observed in 37% of the questionnaires: seven PA questionnaires [30,33,42,44,48], four SB questionnaires [51,55,56,59], eight questionnaires combining PA and SB [57,63,65,68,74], and three questionnaires combining PA, SB and sleep [67,77,83]. Sample characteristics for the reliability results were not specified in 42% of the studies. The time between test and retest ranged between two days to one year. The most often used statistical approaches to assess reliability were Pearson and Spearman correlations, ICCs, Kappa's coefficients and percentages of agreement.

Reliability and measurement error.
For measurement error, Bland and Altman plots comparing test and retest were applied in 31.7% of the questionnaires. Measurement error was calculated in 19 (out of 60) questionnaires and all were rated with 'doubtful' overall measurement error quality, because minimal important change was not reported (PA: four in adults [36][37][38]41], two in older adults [43,46] and two in both [49,50]; SB: two in adults [52,54]; PA and SB: three in adults [64][65][66]; SB and sleep: one in adults [75]; and PA, SB and sleep: three in adults [77,79,80]).

Responsiveness.
The details on responsiveness are provided in a supporting file [see S4 Table]. Only one study (Community Healthy Activities Model Program for Seniors; CHAMPS) [69] evaluated responsiveness. The measures had small to moderate effect sizes (0.38 to 0.64), which resulted in an 'adequate' overall responsiveness quality, given that the results of the study were in accordance its hypothesis. Table 3 presents the summary of the results of risk of bias and its details are provided in a supporting file [see S5 Table]. The overall rating for risk of bias regarding criterion validity was very low for 72.2% of the studies. The main cause for high the risk of bias was the absence of sensitivity and specificity of dichotomous scores. For the overall reliability risk of bias, 23.6% of the studies were rated with a very low risk of bias. For the overall rating of measurement error risk of bias, 21.1% of the studies were classified with very low risk of bias. The main reasons for high the risk of bias for reliability or measurement error were the inappropriate interval between test and retest and the statistical methods used (e.g., correlations instead intra class correlations). For convergent validity, 82.4% of the studies were classified as having an overall very low risk of bias. The only study assessing responsiveness was rated with a medium risk of bias for this measurement property (CHAMPS) [70].

Discussion
This systematic review identified and described questionnaires assessing sleep, SB and PA or the combination of these movement behaviours, in adults and older adults.
We identified 60 questionnaires, describing content and measurement properties. Of these, 25 questionnaires measured PA, 12 SB, one sleep, 12 the combination of PA and SB, one the combination of SB and sleep, and nine the combination of PA, SB, and sleep. Results showed high heterogeneity in the questionnaires' content, measurement properties and quality, which precluded a meta-analysis. Indeed, the questionnaires' content varied substantially in terms of behaviour's domain assessed, response method, measurement units, scores, recall and assessment periods, as well as, in the number of items and parameters evaluated.
The validity of the included questionnaires was mostly assessed by comparing the questionnaire with accelerometers, and the quality of validity results was frequently 'inadequate'. This could potentially be due to desirability bias or cognitive issues related to recall or comprehension of the questionnaires [9].
Only one questionnaire (APAQ) [77] measuring the combination of sleep, SB and PA showed 'adequate' overall quality for criterion validity. However, the validation results were only for total energy expenditure, which requires careful interpretation, because this outcome poses some limitations; as, the energy expenditure depends on other factors rather than movement behaviours (e.g., resting energy expenditure and thermic effect of food) [6]; accelerometery is not the most appropriate criterion measure to assess energy expenditure [6]; we cannot determine the results for each behaviour; and this is not a time-focused variable. In this sense, to assess a given movement behaviour, the actual time spent in it, seems to be a better output, which is the output generated by accelerometery. Although the limitations of accelerometery are well known, this still seems to be one of the best objective criterion measures to assess time spent in movement behaviours, in free living conditions [84]. Likewise, devices combining heart rate monitoring and accelerometery technologies to assess the intensity and time spent in different movement behaviours [6,85] might also be adequate options for validation studies.
Regarding the reliability of the included questionnaires, there were different intervals between test and retest and the overall results' quality was also frequently 'inadequate'. However, these results are dependent on the number of scores that authors evaluated. For example, the PASE [68] was rated with an overall 'adequate' reliability; however, the authors only assessed the reliability for a single score; whereas in more complex questionnaires (i.e., with more scores), such as the WSQ [57], that presented 'adequate' reliability result in a general score (i.e., total, all domains ICC = 0.80), the overall quality was considered 'inadequate', due to the separated scores for reliability. Furthermore, the statistical procedures used by the different studies were often considered 'inadequate', mainly because Pearson or Spearman correlations were used instead of ICCs or Kappas, or because the time interval between test and retest was inappropriate. Indeed, despite Pearson and Spearman correlations do not have into account for systematic errors [29], these have been widely used in validity and reliability studies; however, it is well known that for continuous scores, ICCs are considered more appropriate, while for categorical scores, Kappas are advised [86]. For absolute validity by means and limits of agreement, Bland and Altman plots are recommended [87]; however, these were calculated only in a few of the included studies, either to report on validity or on reliability. Our findings largely contradict the conclusions of the studies included in this review, which considered that the questionnaire under study was valid and reliable, given that these studies used other metrics instead of the COSMIN quality criteria.
IPAQ [65] showed at simultaneously 'adequate' reliability and convergent validity, but not for criterion validity. For a questionnaire have an adequate validation, at least 'adequate' overall validity and reliability need to be attained, and a criterion measure is better than a convergent one to that purpose [21].
Responsiveness was only tested for CHAMPS [70]. Other reviews have also reported a lack of responsiveness assessment of questionnaires measuring PA [11,19]. However, assessing questionnaire's responsiveness is paramount to understand whether they are capable of measuring changes in movement behaviours over time [25].
Many questionnaires showed a high variability in content, together with inadequate measurement properties, which highlights the complexity of assessing the full spectrum of movement behaviours across the 24h period and reinforces the need for better self-reported questionnaires to measure movement behaviours combinations. The emergence of the 24h movement guidelines, due to its specific characteristics, raises the need to adapt or develop de novo instruments to assess 24h movement behaviours. The same concern has been raised regarding the new WHO PA and SB guidelines for adults [88].
The lack of questionnaires assessing 24h movement behaviours in an integrated way precludes accurate report of 24h movement behaviour guidelines' compliance and trends over time [89,90], increases the risk of misclassification, and of biased and unreliable results [10]. Moreover, whilst new guidelines are developed and public health efforts to increase PA and decrease sedentary time proceed, measurement instruments should be improved; surveillance systems are adapted, and broadly and repeatedly implemented [91]. Indeed, measuring movement behaviours is complex and there is a need for better solutions, mainly to assess all movement behaviours in an integrated fashion. Given the measurement properties and the content of the questionnaires assessing a combination of all movement behaviours herein presented, there seems to be no single questionnaire capable to accurately measure these behaviours, considering the new 24h movement paradigm.

Limitations and strengths
We systematically reviewed existing questionnaires that measure all movement behaviours together or isolated, in adults and older adults. Comparing questionnaires' measurement properties is complex, given the heterogeneity of the data, including different scores, domains, variety of recall periods, comparison measures and reporting units. For example, the studies using accelerometery data to assess questionnaires' validity applied different epoch lengths, different definitions of (non)wear time and different placement sites. These aspects make comparisons between studies very difficult. Although the use of COSMIN guidelines should be considered a strength of this review, the COSMIN cut points to evaluate the quality of measurement properties may somewhat lead to loss of information, due to the mechanistic way of analysing data. Also, to the best of our knowledge, this review contains the largest sample of data/questionnaires assessing movement behaviours.

Conclusions
We systematically reviewed existing questionnaires that measure sleep, SB or PA, or their combination, in adults and older adults. There are several questionnaires with different characteristics and outputs for all movement behaviours. The included questionnaires presented frequent methodologic limitations, that resulted in inadequate validity and reliability scores. Existing questionnaires have insufficient measurement properties, and none was developed considering the 24h movement behaviour paradigm. The lack of valid and reliable questionnaires assessing 24h movement behaviours in an integrated way, precludes accurate monitoring and surveillance systems of 24h movement behaviours.